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Abstract —When a piece of information (microblog, photo¬ 
graph, video, link, etc.) starts to spread in a social network, an 
important question arises: will it spread to “viral” proportions 

- where “viral” is defined as an order-of-magnitude increase. 
However, several previous studies have established that cascade 
size and frequency are related through a power-law - which leads 
to a severe imbalance in this classification problem. In this paper, 
we devise a suite of measurements based on “structural diversity” 

- the variety of social contexts (communities) in which individuals 
partaking in a given cascade engage. We demonstrate these 
measures are able to distinguish viral from non-viral cascades, 
despite the severe imbalance of the data for this problem. Further, 
we leverage these measurements as features in a classification 
approach, successfully predicting microblogs that grow from 50 
to 500 reposts with precision of 0.69 and recall of 0.52 for the 
viral class - despite this class comprising under 2% of samples. 
This significantly outperforms our basefine approach as weli as 
the current state-of-the-art. Our work also demonstrates how we 
can tradeoff between precision and recall. 

I. Introduction 

When a piece of information (microblog, photograph, 
video, link, etc.) starts to spread in a social network, an 
important question arises: will it spread to “viral” proportions 

- where “viral” is defined as an order-of-magnitude increase. 
Several previous studies [1], [2] have established that cascade 
size and frequency are related through a power-law - which 
leads to a severe imbalance in this classification problem. 
In this paper, we devise a suite of measurements based on 
“structural diversity” that are associated with the growth of a 
viral cascade in a social network. Structural diversity refers 
to the variety of social contexts in which an individual en¬ 
gages and is typically instantiated (for social networks) as the 
number of distinct communities represented in an individual’s 
local neighborhood. Previously, Ugander et al. identified a 
correlation between structural diversity and influence [?]. We 
demonstrate these measures are able to distinguish viral from 
non-viral cascades, despite the severe imbalance of the data 
for this problem. Further, we leverage these measurements as 
features in a classification approach, successfully predicting 
microblogs that grow from 50 to 500 reposts with precision 
of 0.69 and recall of 0.52 for the viral class (under 2% of the 
samples). 

We note that our results on the prediction of cascades 
rely solely upon the use of our structural diversity based 
measures for features and limited temporal features - hence 
the prediction is based on network topology alone (no content 
information was utilized). We also achieved these results while 
maintaining the imbalances of the dataset - which we felt better 
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mimics reality. This differs from some previous studies which 
balance the data before conducting classification. Further, 
we note that we obtained prediction of order-of-magnitude 
increases in the size of the cascade - which also differs from 
other work (i.e. [1]) which focus on identifying cascades that 
double in size. 

II. Technical Preliminaries 

Here we introduce necessary notation and describe our 
social network data. We represent a social network as a graph 
G = (V, E) where V is the set of vertices and E as set 
of directed edges that have sizes |V|,|.E| respectively. The 
intuition behind edge (v,v') is that node v can influence v'. 
This intuition stems from how we create the edges in our 
network: (v,v') is an edge if during a specified time period 
there is at least one microblog posted by v that is reposted by 
v' (we leave other thresholds beyond 1 repost to future work). 
We shall also assume a partition over nodes that specifies a 
community structure. We shall assume that such a partition is 
static (based on the same time period from which the edges 
were derived) and that the partition C consists of k commu¬ 
nities: {Ci, C 2 , ..., Cfc}. There are many possible methods to 
derive the communities (if user-reported communities are not 
available). We utilize the Louvain algorithm to identify our 
communities in this paper due to its ability to scale. 

Cascades. For a given microblog 9 , we denote the subset 
of first-m nodes who originally posted or reposted 9 as VJ" 
and refer to them as adopters (at size m). Likewise, the 
set of reposting relationships within the same time period 
will be denoted Rg. Taken together, we have a cascade : 
D™ = {Vg n ,R 7 g L ). Any valid original microblog 9 could 
be treated as a unique identifier for a cascade. Given a 

microblog 9 , vg is the originator at instance tg, which is 

defined as the origin time when the originator posted the 
microblog 9 and time t is time since tg. The m th repost 
of the microblog 9 happens at time t™. As m increases, a 
cascade accumulates nodes and edges over time. We shall 
use Ng to denote the final size of a cascade while the size 
of a cascade at any particular instance is the set of nodes 
present at that instance is simply |Vg"|. For a given size m, 
we shall refer to the frontiers as the outgoing neighbors of the 
adopters in graph G who are not adopters themselves. For¬ 
mally: F™ = {v £ V/Vg 1 s.t. 3vi £ Vg 1 where (vi,v) £ E}. 

For nodes in G that are outside the adopters, we shall use 

the notation t exp (v,9,m) to denote the number of time units 
from the initial post of 9 before the microblog was reposted 
by one of v’s incoming neighbors - intuitively the time at 
which v was exposed to 9. For a given natural number A 
(used to specify a time period), we define the A frontiers 
as a subset of the frontiers that have been exposed to 9 



no earlier than A time units previously. Formally this set 
is defined as follows: A™’ A = [v G F™|f ea;p (u, 0, m) < A}. 
Finally, the complement of this set are the A non-adopters : 

F™' X = {v e Fg l \t exp (v, 0, m) > A}. 

Sina Weibo Dataset. The dataset we used was provided by 
WISE 2012 Challenge 1 . It included a sample of microblogs 
posted on Sina Weibo from 2009 to 2012. In this dataset, we 
are provided with time and user information for each post 
and subsequent repost which enabled us to derive a corpus 
of cascades. From this data, we derived our social network 
G = (V, E) (with 17.9 M vertices and 52.4 M edges) that 
was created from reposts that were published during the 3 
month period between May 1,2011 and July 31, 2011. For this 
network, the average clustering coefficient is 0.107. There are 
4974 connected components in the network. Louvain algorithm 
outputs 379,416 communities with average size of 47.5 for 
this network. As expected, this network exhibits a power-law 
degree distrubtion. For this network, the number of active 
nodes in August (the time period we studied for cascade 
prediction) is 5,910,608, while 5,664,625 of them at least have 
one out-neighbor. During the month of August, there were 
9,323,294 reposts with 2,252,368 different original microblogs. 
1,920,763 (86.6%) of them were written by authors who at 
least published one microblog during May 1, 2011 to July 
31, 2011 (the time period we used to create the underlying 
network). The average time it took for viral cascades to become 
viral is approximately 18 hours. The distribution of final 
size of cascades mimics a power-law distribution which can 
demonstrate that this dataset is more representative of cascade 
behavior observed “in the wild”. This differs significantly 
from the previous works which conduct biased sampling to 
artificially provide balanced classes. We selected A as 30 
minutes as 90% of all reposts in the initial 3 month period 
occurred in under this time. 

Number of communities. For V' C V, the associated commu¬ 
nities C(V') are the communities represented by V . Formally: 
C(V') = {Ci £ C s.t. V'nGi ^ 0}. The cardinality of this set 
(number of communities) will be denoted AT (14). We measure 
the number of communities represented by the above three 
populations of nodes: A'(J4"), K(F™’ X ), A'(A™ ,a ) observed 
at either a given cascade size. 

Gini impurity. For V C V, the gini impurity, Ig(V') is the 
probability of a node in V’ being placed into the incorrect 
community if assigned a community based on the distribution 
of communities represented in 14. Formally: Ig(V') = 

1 — We study the gini impurity of the adopters, 

A non-adopters and A frontiers for either a given cascade size 
m: I G (V e m ), /g(-F}T’ A ), Ig{F™' X ). The intuition is to capture 
a notion of how the communities are distributed amongst the 
nodes in each of these sets with a single scalar value. We note 
that the impurity of the adopter set Ig^Vq 1 ) behaves similar 
to the entropy of this set (a measurement introduced in [3]). 
However, as we will see in the next two sections, we found 
that the impurity of the A frontiers is a more discriminating 
feature. 

Overlap. For 14,14 C 14 the overlap (0(14,14)) is simply 
the number of shared communities. Formally: 0(14,14) = 

1 http://www.wise2012.cs.ucy.ac.cy/challenge.html 


|C(14) n C(14)|. We study overlap between adopters and A 
frontiers, between adopters and A non-adopters, and between 
A frontiers and A non-adopters: 0(14", F™’ X ), 0(V e m , Fg lX ), 
and 0(A™’ A , F™’ X ) respectively. The intuition with overlap 
stems directly from the original structural diversity results 
of [?] - for instance a high overlap between adopters and 
A frontiers may indicate that the A frontiers are linked to 
adopters with inner-community connections and high structural 
diversity - hence increasing the probability of adoption. 

Average time to adoption. The average time to adoption for 
the nodes in the current set of adopters (once the cascade grows 
to size m): *4 e . We also use average time to adoption as 

a baseline measure. 

III. Results 

Here we examine the behavior of the various struc¬ 
tural diversity measurements as viral and non-viral cascades 
progress. We define a cascade as viral if the number of 
reposts reaches a threshold (denoted TH ) of 500 (in the 
next section we will explore other settings for ATT when 
describing our classification results). We look at snapshots of 
the cascades as they progress both in terms of size (denoted 
to). For to = {10,30,50,100,200}, the number of samples 
is {98832, 26733,13285,4722,1324} respectively with 208 of 
the samples are viral. With each size to we consider the 
Cascades with to adopters at some time t™, t™ can vary for 
different 9. Hence, cascades with final size W < to are ignored 
in our analysis task. This leads to a decrease in the number of 
non-viral Cascades as in increases. 

Average time to adoption. As a baseline measurement, we 
study the average time to adoption for each size-based stage 
of the cascade process (Fig. li. Fig. lj). As expected, viral 
cascades exhibit a faster rate of reposting. While we note that 
significant differences are present - especially in the early 
stages of the cascade, the whiskers of the non-viral class 
indicate a significant proportion of non-viral cascades that 
exhibit rapid adoption. We believe this is likely due to the fact 
that certain cascades may have very high appeal to specialized 
communities. 

Number of communities. Fig. la. Fig. lb. Fig. lc and Fig. Id 
display how the number of communities K{V') increases over 
to = {10,30,50,100,200} for the sets V' = {F™’ A }. 
We note that A'(Vg") (the communities represented in the 
set of adopters) was shown to be a useful feature in [3] 
for tasks where the target class had fewer reposts than in 
this study. Here, we note that while statistically significant 
differences exist, the average and median values at each of the 
examined stages are generally similar. On the other hand, the 
communities represented by the set of A frontiers (K(F^ l ’ X )) 
shows viral Cascades have stronger capability than non-viral 
ones to keep a diverse set of A frontiers. We also noted that the 
median of K(F™’ ) (not pictured) shows viral cascades start 
with smaller AT(A™’ A ). However, it increases faster in viral 
cascades as nodes in A frontiers becomes A non-adopters. 

Gini impurity. Cascades in both classes tend to accumulate 
diversity in the process of collecting more adopters - and we 
have also noted that a related entropy measure (studied in [3]) 
performed similarly. We also noted (not pictured) that in the 



TABLE I: Features: Cascade Prediction over Time and Size 


M: 8.0 
A: 7.7 


18.0 25.0 35.0 48.0 
17.5 24.0 34.9 47.6 


M: 8.0 
A: 8.1 


17.0 23.0 34.0 46.5 
17.3 23.5 34.0 46.5 


Group 

Feature(s) over size 


K(F™' x ), K {F™’ x ),I g {V™),I g ( F ™’ x ) , Ig (F™' x ), 

A m 

0(V g m , F™’ x ),0(V e ™, F™> x ),0(F™'\ F™' x ), 


IFT’ a I. IFT’ a I. {30,50} 

B m 

Community Features Mentioned in [3] and C m 

Cm 

“ i=1 tf> . m = 50 


early stages, viral cascades can show more diversity in A fron¬ 
tiers measured by Ig(Fq 2 ’ X ) (m = {10, 30, 50}). But, perhaps 
most striking, that non-viral Cascades gain more uniformly 
distributed nodes over communities in A non-adopters, shown 
by lG(Fg n ’ X ) (Fig. lg. Fig. lh). We believe that this is due to 
non-viral cascades likely have an appeal limited to a relatively 
small number of communities - hence those not adopting the 
trend may represent a more diverse set of communities. 

Overlap. We found that overlap grows with the number 
of adopters in the three types of overlap considered. For 
0 (Vg",F™), viral cascades start with a larger initial value 
and keep leading non-viral ones in the diffusion process of first 
200 nodes (Fig. le. Fig. If). This may hint that viral cascades 
also take advantage of the densely linked communities to help 
them become viral. However, in the case of 0(1}}™, F™) and 
0(F™’ X ,F™’ X ), viral cascades begin with lower value but 
grow much faster than non-viral Cascades. 

Classification Experiments. Here we examine our experi¬ 
ments for predicting whether a cascade becomes viral - when 
a size threshold (TH) exceeds 500 adopters given that the 
cascade has 50 adopters (s = 50). Based on the distribution 
of final size of cascades in this dataset, this is a binary 
classification task with two heavily imbalanced classes. Hence, 
we report performance measurements (precision, recall and 
FI score) for only the minority (viral) class. Throughout the 
course of our experiments, we found that varying threshold 
(slightly modifying the definition of “viral”) for only the train¬ 
ing set allows for a trade-off between precision and recall. We 
study the trend of performance measures in two cases: (1.) The 
threshold for test set is maintained as TH ts = 500 while the 
training threshold is varied TH tr = {300,400, 500, 600, 700}. 
(2.) The two thresholds are kept as the same TH while we 
modify this value TH = {300,400,500,600,700}. 

Table I shows the groups of features used in our prediction 
tasks. The features introduced in this paper is group A m . As 
a baseline method for size-based prediction (feature group 
C m ) we used average time to adoption. We also compare our 
features (Group A m ) with the community features extracted 
in [3] (Group B m ). This was the best performing feature set 
in that paper for a comparable task. 2 Additionally, we study 
the average size of recalled and non-recalled viral cascades by 
classifiers using features in groups A m . We also investigate 
the significance and performance of individual and certain 
combinations of features introduced in this paper. 

We used ten-fold cross-validation in our experiments to 



(a) Number of communities amongst 
adopters (K (V^ 771 )) for non-viral cas¬ 
cades 
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(b) Number of communities amongst 
adopters ( K(V for viral cascades 


M: 7.0 15.0 20.0 27.0 33.0 

A: 25.7 39.6 53.2 88.5 111.1 


M: 21.0 30.0 30.0 33.5 42.5 
A: 24.3 41.7 44.4 78.7 88.6 
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(c) Number of communities amongst 
A frontiers (K(F™' X )) for non-viral 
cascades 


M: 3.0 9.0 12.0 19.0 26.0 

A: 3.7 8.7 12.3 18.5 25.2 



Number of Adopters 


(d) Number of communities amongst 
A frontiers (K(F™' X )) for viral cas¬ 
cades 


M: 7.0 13.0 17.0 22.5 31.0 

A: 6.7 12.7 16.5 22.2 29.5 
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(e) Overlap of adopters and A fron¬ 
tiers (O(V r 0 m , F™' X )) for non-viral 
cascades 


(f) Overlap of adopters and A fron¬ 
tiers (0(V™,F™’ A )) for viral cas¬ 
cades 
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1 .Or 



0.2 


10 30 50 100 200 

Number of Adopters 


M: 0.0 0.9 0.9 0.9 0.9 

A: 0.4 0.8 0.9 0.9 0.9 


1.0 r 



0.6 


0.2 


10 30 50 100 200 

Number of Adopters 


(g) Gini impurity of A non-adopters 
( Ig(F ™' X )) for non-viral cascades 


M: 865.9 853.1 804.5 754.6 765.2 
A: 780.3 790.1 771.0 753.8 759.9 
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(h) Gini impurity of A non-adopters 
(7 g(F™ ,A )) for viral cascades 

M: 15.3 49.7 78.4 168.1 301.1 

A: 40.9 86.9 129.4 215.8 347.7 
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2 This was their highest-performing set of features for predicting cascades 
that grew from 50 to 367 and 100 to 417 reposts. We also included the 
baseline feature in this set as we found it improved the effectiveness of this 
approach. 


(i) Non-viral cascades (j) Viral cascades 

Fig. 1: Number of communities, gini impurity, overlap and av¬ 
erage time since tg to adoption for m = {10,30, 50,100,200} 






























































Fig. 2: Classification results based on groups of features 
(A m ,.B m ,C' m ) extracted when m = 50 for fixed TH tr = 500, 
THf a = 500. Error bars represent one standard deviation. 



(a) Results for features in A m with 
different Tlfir- 
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(b) Average final size of viral cascades 
(recalled, mean and not recalled) 


ensure the results do not take any advantage of random¬ 
ness in picking training and testing sets. First we carried 
out the prediction tasks with fixed thresholds TH tr = 
500, TH ts = 500. Then we modify the training thresh¬ 
old TH tr = {300,400,500,600,700} to show how this 
achieves a trade-off between precision and recall. The differ¬ 
ence in average final size between correctly classified viral 
cascades and incorrectly classified ones is also monitored 
over TH tr = {300,400,500,600,700} to show the potential 
to predict exact number of adopters by features. Further¬ 
more, we modify threshold of both training and testing sets 
TH = {300,400,500,600,700} to show the robustness of 
our features on related classification problems. We used the 
oversampling method SMOTE with random forest classifier to 
generate synthetic samples for the viral class. Other, lesser- 
performing classifiers were also examined (including SVM, 
MLP, and other ensemble methods) and are not reported here. 
All results shown in this section is a sample mean produced by 
ten repeated experiments under each combination of variables. 

Size-based prediction. We studied cascades of size 50 that 
reached 500 for this task. There are 13,285 cascades that can 
reach the size m = 50 while 208 out of them reached the size 
of 500. Maintaining the threshold TH = 500, Fig. 2 shows 
random forest classifier trained with features in group A m can 
outperform the other groups. The trade-off between precision 
and recall can be achieved by changing the training thresh¬ 
old TH tr while maintaining the testing threshold (Fig. 3a). 
We also note that the average final size of viral cascades 
recalled by the classifier increases with the training threshold 
(Fig. 3b). With threshold TH = {300,400,500,600,700} 
on both training and testing samples, the features of group 
A m consistently outperform those previously introduced (B m ) 
(Fig. 3c, Fig. 3d). 



(c) Results for features in group A m (d) Results for features in group B rn 
when THtr and THts change. when THtr and THts change. 


Fig. 3: Prediction results for A m when TH tr and TH ts 
change. Error bars represent one standard deviation. 


Name 

Features 

Weights 


O(Vg 0 , Fg°’ X ) 

0.50 


O(V 0 3O ,F 3 °’ A ) 

0.04 

Overlap 

o(f| 0,a ,f 30 ’ a ) 

0.23 


O(Vf 0 ,Fl°- x ) 

0.50 


o(f™’ a ,f|°' a ) 

0.26 


Name 

Features 

Weights 

Gini 

Baseline 

/ g (F 30 ' a ) 

J g (F 3 °' a ) 

J G (Ff' A ) 

4 

_50_ 

0.02 

0.02 

0.52 

1.00 


TABLE II: Weights of features assigned by randomized logistic 
regression models 

larger 0(Vg n ,Fg l ’ X ) value for adopters have larger chance to 
influence the A frontiers than non-viral cascades. Moverover, 
the gini impurity of A non-adopters also shows its importance. 
Intuitively, non-viral cascades are easier to be trapped in a 
relatively small amount of communities. This means even if 
they could show up in people’s timeline with high structural 
diversity but can not get them infected. 


Feature investigation. Here we investigate the importance of 
each feature in A rn . With TH tr = 500 and TH ts = 500, 
we trained 100 randomized logistic regressions models - each 
assigning weights to the features in those sets. We then 
categorized the features with weight larger than 0.01 (on 
average) into groups such as overlap, gini impurity, etc. Then, 
we performed classification on the basis of single feature 
categories or combination of such categories. The average 
weights assigned are shown in Table II. As shown by these 
results, overlaps can make significant contribution to cascade 
prediction. Intuitively, communication between two sets of 
nodes is more likely to happen in their shared communities - 
which is consistent with the results of [?]. This implies that the 
larger overlap value, the more influence of one set on the other. 
For example, we can infer that viral cascades tend to have 
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